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Abstract 

We propose a strategy to achieve the fastest convergence in the Wang-Landau algorithm with 
varying modification factors. With this strategy, the convergence of a simulation is at least as good 
as the conventional Monte Carlo algorithm, i.e. the statistical error vanishes as 1/Vi, where t is a 
normalized time of the simulation. However, we also prove that the error cannot vanish faster than 
1/t. Our findings are consistent with the 1/t Wang-Landau algorithm discovered recently, and we 
argue that one needs external information in the simulation to beat the conventional Monte Carlo 
algorithm. 
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I. INTRODUCTION 



Many researchers in various fields have been interested in using and improving the Wang- 
Landau(WL) algorithm. PQ |2] This algorithm is in fact a general scheme to tackle problems 
that can be transformed into the problem of estimating a probability distribution. This 
probability distribution can represent the simple density of states as a function of energy 
only, [H, [21 [31 m [5j but the more interesting and challenging applications are the joint density 
of states as a function of energy and a second variable. [H [71 [H [9l [TOl [Til [I2l [111 HI] The second 
variable can be the volume of a liquid/gas system, the reaction coordinate of a polymer or 
a protein molecule, or the magnetization of a spin system. The (joint) density of states 
are subsequently used to calculate the partition function, but it is also doable to calculate 
the partition function directly with the WL algorithm without going through the density of 
states. |l5j Apart from these physical models, several authors have explored the possibility of 
performing numerical integration with the WL algorithm. flQ{ [T7] In contrast to alternative 
sampling methods, such as the umbrella sampling|18j and multicanonical algorithm, (TH] the 
WL algorithm almost seems to be a universal approach, because it does not rely on working 
out a good range of energy or distribution function to sample in advance. However, most 
implementations of the WL algorithm involve some "proprietary" enhancements designed 
to best suit the problem being studied. Some general strategies to improve the algorithm 
for a large class of problems have also been proposed. [HI (201 [211 122] 

The original WL algorithm measures the histogram of the simulation, as soon as the 
histogram achieves a certain "flatness" , which is a tunable parameter in the simulation, the 
modification factor is reduced to fn+i = vTn- Many possible modifications have already 
been suggested and tested empirically. One type of modifications involves the flatness cri- 
terion. Although to achieve a fiat histogram is the initial motivation of the WL algorithm, 
Ref. [23 suggested that the flatness is not a necessary criterion to achieve convergence. Usu- 
ally, the flatness is defined as the ratio of the fluctuation of the histogram (or other equivalent 
quantities) to the average histogram. In fact, fluctuation of the histogram is an intrinsic 
property of the WL algorithm. A simulation with constant / always improves the flatness, 
due to the increasing average histogram, but the amplitude of fluctuation eventually satu- 
rates. The prediction in Ref. |23] that the fluctuation of histogram is proportional to l/i/ln / 
has been verified by different authors independently. f^M [2S| Therefore, many authors often 
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replace the flatness criterion with that of a minimum histogram. This also suggests that one 
should instead focus on the fluctuation of the histogram rather than the flatness. Actually 
the average histogram is absorbed into the normalization constant of the density of states, 
so the flatness is not a good indicator of convergence. 

On the other hand, the exponential reduction of / mainly came out of the expectation 
to achieve an exponential convergence. Several authors have found the error of the original 
WL algorithm to saturate, while relaxing the exponential reduction rule seems to offer a 
better convergence and accuracy. [251 12H1 [221 EO] We mention in passing that this type 
of error saturation is different from the ^/ln f statistical error for a fixed /. Ref. [2^ and I5U1 
clearly illustrate that even if / is reduced to a very small value according to the original 
prescription, the statistical error stops to decrease at a certain point. This phenomenon 
suggests that exponentially reducing / is not the best strategy. Intuitive strategies to adjust 
/ have been proposed, such as increasing / once in a while during the simulation. [28] The 
most impressive improvement has been observed for the simple 1/t rule. [221 ED] However a 
generic principle for achieving the optimal convergence is unknown. 

The purpose of this paper is to derive such a principle to adjust / in the simplest and 
most generic setting. The very first question one may ask is, what is the simplest and most 
generic setting of the WL algorithm? The literature so far is not clear on this issue. With 
implementation of improved algorithms mixed with unexplored models in different fields, 
the results are usually a convoluted mixture of model-specific properties and the generic 
behavior of the WL algorithm. One may suggest that the Ising model is a generic model 
to start with. However, it is easy to see that in terms of measuring the performance of the 
algorithm, the Ising model is not different from simply counting the number of states with the 
same total spin. [23] If we flip a single spin in each update, the total energy or total spin can 
only change by a small amount every time. The WL algorithm compares g{Ei) and g{Ef), 
i.e. density of states before and after a move, but it does not require Ei to be close to Ej. 
In principle, one can implement algorithms that allow "non-local" moves in the parameter 
space, e.g. cluster algorithm. The update schemes for the underlying model certainly have 
an effect on the outcome, which also makes rigorous analysis almost impossible. This is an 
effect that we want to factor out. So what models are more generic than the Ising model? 
Numerical integration [151 ttZl ED] is a better candidate, since the evaluation of the integrand 
and moving from one sampling point to the other are both fast. The time spent in selecting 
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the next sampling point is thus neghgible. Selecting the next sampling point X, may it be 
an energy, a 2-dimensional vector, or a value of the integrand, is an encapsulated process 
that depends on the actual problem to solve. What the high-level WL "driver" asks for is 
only the next sampling point with probability distribution P{X) oc g{X)/g{X), where g{X) 
and g{X) are the exact and estimated density of states respectively. 

This is in fact the starting point of analysis in Ref. [231 which mainly dealt with the 
dynamics of the simulation with a fixed /. This fluctuation leads to a statistical error in the 



density of states proportional to y/\n f. The multistage WL algorithm partially reduces this 
statistical fluctuation by decreasing /. However, its overall efficiency in reducing statistical 
error is not necessarily superior than the conventional Monte Carlo algorithm. [29l [30] This 
is why Ref. [231 suggests to average over multiple independent simulations with a single /. 

In Secjll] of this paper, following the same approach, we derive an optimal strategy for 
updating the modification factor; in Sec. [IIH we derive the upper and lower bound of 
the convergence with this optimal strategy, and compare it to the very impressive 1/t WL 



algorithm; Sec. IV presents our conclusion and discussions. The method we use in our 
study is entirely different from the well-known argument based on detailed balance. In 
fact, detailed balance applies to simulations with a set of constant transition probabilities. 
Most sampling algorithms fall into this class, including all importance sampling algorithms. 
The basic Metropolis algorithm|31j is indeed the "driver" of them. However, the transition 
probability in the WL algorithm is continually updated, therefore detailed balance does not 
directly apply to it. What we focus on in this paper is the WL "driver". Similar to detailed 
balance, its mathematical validity warrants many applications, although not all physical 
models with specific pitfalls are guaranteed to be solved. Therefore we only demonstrate 
our strategy on the most popular testing case, i.e. 2-dimensional Ising model. We believe the 
strategy we found is simple to implement in any existing simulations with the WL algorithm, 
and that its performance is comparable to the 1/t WL algorithm. 



II. OPTIMAL MODIFICATION FACTOR 



The WL algorithm is not a Markov process, because the transition probability depends 
on the history of the simulation. It might be necessary to re-iterate the main idea of Ref. [231 
Suppose there are N bins (energies) in a generic simulation, the key random process to focus 
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on is the probability distribution Pi{t) of the next move, with i = 1, . . . , N. If bin / is picked, 
this probabihty distribution evolves as: 

The transition probability from p{t) to + 1) is just pi{t). The process p{t) is indeed 
a Markov process. p{t) lives as a vector inside an iV-dimensional simplex, whose center is 
the uniform distribution = for 1 < / < A^. Convergence means that the estimated 
density of states forces p(t) to approach p. 

In Ref. [231 the level of convergence is measured by the function: 

N 

fi{t) = NlnN + J2^^Piit)- (2) 

i=l 

If the simulation has reached the exact density of states, therefore, a "flat" histogram, then 
p{t) = p, and fi{t) =0. In a simulation, always starts from a negative value and 
approaches from below. The expected increment in is given by 



E + 1) - /i(t) \J^,] = -lnf + Nj2 Pk{t) In ^ _ _ ■ (3) 



N 

V,, (^-\^r. 

Here the "filtration" jFj is a Borel algebra which represents the information available at 
time t. p{t) and /i(t) are supposed to be available at time t, the expectation is over 
different fi{t + 1). It was proved in Ref. |23] that if the distance between p{t) and p is 
sufficiently large, fi{t + 1) increases on average. Thus p serves as an attraction center that 
pulls p{t) towards it. However, this attraction has a short-range cut-off at a characteristic 
distance C(/) = ^[{1 - f-^)-^\n f - 1]/N. At the same time ({f) determines how much 
the estimated density of states fluctuates around the exact density of states. Obviously the 
amount of fluctuation is reduced by decreasing / in the simulation. 

In practice, a systematic error in the simulation exists, which is a function of / and the 
auto-correlation between successive additions to the histogram. As observed in Ref. [231 this 
systematic error becomes smaller when either / or the correlation decreases. The amount of 
correlation is reflected by the tunneling time, 132] and Ref. [33] offers a systematic way to 
decrease the tunneling time for one-dimensional density of states. But in general, decreasing 
the correlation (tunneling time) in the WL algorithm, especially for those calculating 2- 
dimensional joint density of sates, is not an easy task. This is why it is necessary to 



reduce / to achieve accurate results. Otherwise, averaging over independent simulations as 
suggested in Ref. [23] would be the best solution. In a multistage simulation, one has the 
freedom to chose fn+i- We realize that there is an optimal / in the sense that it maximizes 
Eq. (jsj). Selecting this optimal value for fn+i will bring fi{t) to zero at the fastest speed. 
An intuitive analogy is the random deposition model, where we would relate / to the size 
of the next particle to be deposited. If / is too small, it takes many such particles to fill up 
the existing depressions in the landscape. If / is too large, instead of filling up the dips, the 
new depositions actually create a more hilly landscape. 

To find this optimal /, we define y = 1 — and rewrite the expected change Eq. ^ as 

^ 1 

G{p{t),y) =ln{l-y) + Nj2 Pk{t) In ^_ . (4) 

k=i ^'^^ 



The Taylor expansion of this function 



+00 / TV \ 
n=l \fc=l / 

is easier to analyze. The coefficient of the first linear term is equal io N\\pk — P\\^ ■ It is 
positive except when the simulation has perfectly converged. The higher order terms are 
negative if Pk{t) is in the vicinity of p, which is the region that we care about most. In fact, 
when p{t) ~ 1/N in this region, the coefficient of nth term is roughly (A^^"+^ — 1)/Ti, which 
is negative for n > 1. (Assuming is a large integer.) The function G{p,y) thus has a 
unique maximum value for y G [0, 1). Suppose this maximum value is achieved at yc, then 
the corresponding fc = 1/(1 — ?/c) is the optimal modification factor that brings fi(t) towards 
zero at the maximum speed. 

In the refining stage of an actual simulation, y is close to zero, since / is close to 1. In the 
following we will also refer to y as the modification factor for convenience, as it is unlikely 
to confuse it with /. It should be sufficient to retain two terms of the function G{p,y): 



G{p,y)^N\\p-pfy-^ (^^ - nJ^pI^ ■ 



(6) 



With this approximation, the optimal value for y is obtained as 



mv{t)-pr 
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On the denominator, since Pk{t) is of order in the cases that we are interested in, the 
second term is of order For large systems, this correction can be safely omitted as well. 

The nominator in Eq. ([T]) is simply a measure of the fluctuation of the estimated density 
of states, which is in agreement with our earlier analogy to the random deposition model. 
"The optimal size of the deposition", i.e. fc ~ 1/(1 — N\\p{t) — is determined by the 
current roughness of the landscape. 

We illustrate this optimal strategy with the traditional testing case, the 2-dimensional 
Ising model. As suggested in Ref. [231 we update the estimated density of states once every 
K flips, so that the random walker has a chance to travel around between two updates in the 
estimated density of states. We arbitrarily set K = N/2 and the initial modification factor 
In/o = 0.1, but we have tested that the simulation is stable with other reasonable choices. 
Once all the energies have been visited, we start updating the modification factor. Since the 
exact density of states is known, p(t) can be directly calculated as pi(t) = Z^^g{Ei)/g{Ei), 
where Z is a normalization constant. The modification factor is updated with 

\nf = aNj2iPiit)-yNy, (8) 

I 

where < a < 1, and we have used a = 0.1. [Note that only the first order term in the 
Taylor expansion of Eq. ([s]) matters. This choice is also convenient since in actual simulations 
only In g{E) is used.] The simulation continues with the updated / until all energies have 
been visited again. We again update / with Eq. ([s]) and run until all energies are visited. 
This cycle repeats until the desired accuracy in density of states is reached, which is also 
reflected in vanishing In/. 

The choice of a requires some justification. The simulation of Ising model, auto- 
correlation between updates in g{Ei) indeed exists, leading to the systematic error as dis- 
cussed in Ref. |23l If we look at the error In {g{E)/g{E)), we see the same slow-varying 
component in different simulations even the random number seeds are different. The high- 
frequency fluctuations are indeed random. However, in our simplified case for analytical 
analysis, all the bins are equivalent and independent, which leads us to Eq. ([T]). If we sim- 
ply apply Eq. ([t]), \\p{t) — p|P contains contributions from both this systematic error and 
the uncorrelated statistical fluctuation. As a result, convergence is not guaranteed. Ideally 
one should extract the uncorrelated statistical fluctuation, which can be estimated from 
the high-frequency fluctuations of g{E), assuming g{E) is a sufficiently smooth function. 
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However, the easiest fix that we have discovered is to insert a small positive number a in 
Eq. (§. 

On a second thought, one notices that the systematic error is brought into — p|P 
only because the exact density of states is known and used in our test. If it is not known 
in advance, only the uncorrelated statistical fluctuation can be properly estimated from 
g{E). One can only hope the systematic error to vanish as / decreases, or identify it with 
independent means. To guard against possible overestimation of the fluctuation, we suggest 
to use a as an tunable parameter in all simulations. Thus, there are only three configuration 
parameters, /o, K, and a. /o is a trivial one; both K and a reduce the systematic error 
from auto-correlation in the sampling. After setting these parameters, the simulations is 
worry-free. If the auto-correlation turns out to be a problem, one can either increase K or 
reduce a to fix it. 

Figure [l](a) plots the vanishing fluctuation observed in several simulations and the typical 
pattern of error. These simulations were performed on square lattices with periodic boundary 
conditions. The linear size L that we used ranges from L = 8 to L = 32. The exact density 
of states are calculated with the algorithm in Ref. [311 The fluctuation in p{t) generally 
vanish as 1/t, where t is the number of cycles. Since the simulation visits every energy at 
least once in each cycle, t is approximately proportional to the total number of steps in 
the entire simulation. The / in these simulations also decreases roughly as 1/t since it is 
proportional to the fluctuation. Figure [T]^b) shows that the typical shape of the error at the 
beginning of the simulation and the final error. The initial error is shaped like a camelback, 
but its slow-varying component is almost completely removed after 200 cycles. 

A remaining question is how to estimate Pk{t) and its distance to the uniform distribution, 
when the exact density of states is unknown. A simple approach is to to keep a short 
histogram Hi without modifying the estimated density of states, i.e. setting / = 0. Then 
Pkit) is estimated to be 

Due to the discrete nature of the histogram, in order to estimate p{t) to sufficient accuracy so 
that the calculated yc is not dominated by the sampling error of the histogram, the average 
height of the histogram has to grow larger as p{t) approaches p. This does not seem to 
be efficient in the long run. One can also run multiple simulations with different random 
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FIG. 1: Test with 2- dimensional Ising model, (a) (color online) The reduction of fluctuation, two 

runs of L = 32 with different random number seeds are presented. The fluctuation decreases as 

1/t in the long run. (b) Comparison of error in estimated density of states at different times. The 

systematic error is clearly visible in the upper panel at t = 2, while in the lower panel at t = 200, 

the error is dominated by uncorrelated statistical fluctuation. 
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number seeds but the same sequence of modification factors, the variation in the estimated 
density of states is a measure of the statistical fluctuation \\p(t) —pW^. Anyway, estimating 
the statistical fluctuation can be a time-consuming task in reality. 

A better approach is to adjust / according to a rule that is consistent with the natural 
dynamics of fi{t) governed by the optimal modification factor, so that extra simulation steps 
to estimate \\p(t) — p|p can be avoided. The fluctuation of p(t) has A^ — 1 degrees of freedom, 
where A^ is usually a large integer, its distribution should resemble a narrow Gaussian peak 
centered on its mean value. Therefore, a natural choice is to replace \\p{t) — p|p in Eq. ([t]) 
with its expectation E[||p(t) — which we expect to depend only on the current / and t. In 
fact the test with Ising model suggests that the fluctuation converges roughly as 1/t and one 
may simply set \nft = t~^ln/o, which is in agreement with the 1/t WL algorithm. [29| [30] 
Why this power-law behavior appears and what are the possible values of its exponent will 
be investigated in the next section. 

III. ASYMPTOTIC BEHAVIOR WITH THE OPTIMAL / 

The instantaneous fluctuation \\p{t) — is a random process. With the optimal modi- 
fication factor, it is expected to converge in probability: 

lim P{\\p{t)-pf > e) = 0,Ve > 0. (10) 

t— >+oo 

Due to the Markov inequality, it is a corollary of 

lim E[\\p{t) -pf] = 0. 

t— ++00 

An upper bound for E[||p(t) — is sufficient to prove this convergence, however, we need 
both tight upper and lower bounds to infer a useful rule of adjusting the modification factor. 

A. Upper bound 

For simplicity, we approximate G{p, y) as 

G{p,y) = N\\p-pfy-]^y\ (11) 

The omitted terms are of order 0{y'^/N). Once we adopt the optimal strategy, Eq. ([s]) 
becomes 

E[^{t + = /.(t) + lN'\\pit)-pr. (12) 
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It seems with the optimal modification factor, yu(t) still increases just by a small amount on 
average. We notice that in the vicinity of p, i.e. Npk(t) — 1^1, 

Mt)^-V|b(t)-pf. (13) 

Thus, let S{t) = \\p{t + 1) — pp, we can approximately write in this region: 

E[S{t + l)\J^t] = Sit)-S{tY. (14) 

Both sides of this equation are jF^-measurable random variables. One is tempted to replace 
the left side simply with S(t + 1), and hand-wavingly deduce that S(t) ~ 1/t for large t. 
We approach more carefully here, since S{t) is indeed a random process. Suppose we start 
with ^(O) G (0, 1/2), the quantity that we are interested in is E[S'(t)|JFo], which cannot be 



directly calculated with Eq. (14). To estimate its behavior for large t, we define an auxiliary 
sequence 

Zo = S{0), (15) 
Zt+i = F{Zt) = Zt- Zl (16) 

and prove that E[5'(t)|jFo] < Zt. Obviously, > and Zt+i < Zt, for all t = 0, 1, ■ ■ ■ , 
therefore, lim„^oo Zt = 0. It is also easy to see that Zt ^ 1/t as t ^ +oo, just by observing 
that — Zt^'^ = 1/(1 — Zt), which monotonically converges to 1. Next we use induction 
to prove E[S'(t)|jFo] < Zt. This proposition is true for t = 0. Suppose it is true up to t = n, 
for t = n + 1, we have 



E [Sin + 1) 1 1 J-q] = E [E[Sin + 1) |^o] 

^ Zn ~ Zn = Zn+l, 



<E[S{n)\J^o]-^[Sin)\J^o'^ 

72 



where we have used the Jensen inequality and the fact that F{x) is increasing and concave 
in (0, 1/2). Thus, we have proved that the long term behavior of E[S'(i(:)|jFo] is bounded by 
1/t. If we assume that the error of a stochastic algorithm cannot vanish faster than t^^/^, 
we can already conclude that E[S'(i(:)|jFo] ~ 1/t, since the error in estimated density of states 



is proportional to a/ S{t). Furthermore, the modification factor should decrease as ~ 1/t. 
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B. Lower bound 



Although now we have an upper bound for the asymptotic behavior of E[S'(i)|jFo], we 
cannot rule out the possibility of an exponential decay. Consider a random process defined by 
= Ct{St - S^), where Ci = or 2 with probability 1/2, and So E (0, 1/2). Then P(5„ > 
0) = 1/2", and since the function g{x) = 2(x — x^) has a fix point at 1/2, E[S't|jFo] decreases 
as 1/2*+^. Therefore, the distribution P{S(t + l)\J-'t) has to meet certain requirements for 
the sequence E[S'(t)|jFo] to decay as a power-law. 

To rigorously prove that E[S'(t)|JFo] does not decrease exponentially, our approach is to 
study the evolution of a super-optimized version of the WL algorithm. At time t, if the Ith 
bin is selected, the change in the measure /i(t) is 

A/i«(t) = ln(l - y) + Ann (17) 

Here we first apply the optimal modification factor y = N\\p{t) — p|P = NS{t). In order 
to make the fastest convergence, we can in principle let the computer choose the bin that 
gives the maximum increment in /u(t). One way to do this is to estimate p{t) and choose the 
bin with largest pi{t) in every step, which is of course very difficult in reality. Actually, for 
models such as an Ising model, one cannot simply jump to the energy bin with the largest 
Pi{t), but has to update the energy with a random walk. However, it is all right to perform 
this "gedanken" simulation just for the sake of our analysis. In fact, when this choice is 
made, the simulation evolves deterministically. 

fx{t + 1) - At(t) = 

^..[Hl-NS(t)) + mn^-^±^] (18) 
< ln(l - NS{t)) 



-A^lnU 



1/A^ + y^iN - l)5(t)/Ar] NS{t)j 



The above inequality uses the fact that the maximum possible value for pi{t) subject to the 
constraint that \\p(t) - pf = S{t) and Y.iPii't) = 1 is 1/A^ + \/{N - l)S{t)/N. Similar to 
the earlier derivation, we replace /x(t) with —N'^S{t)/2, and arrive at 



S{t + 1) > S{t) - 2y ^^LJis'^/^t) - 0{S{tfN-^). (19) 
Now assume the auxiliary sequence Zt is defined by Zt+ 1 = Zt- 2Zf\ and Zo = S{0). 



When 5(0) is sufficiently small so that the higher order terms in Eq. (19) can be safely 
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ignored, we can easily see that S{t) > Zt. On the other hand, hm(_»oo Zft"^ is finite, i.e. 
Zt ~ t^^. Clearly, even in the hypothetical super-optimized case, S{t) is bounded from 
below by therefore, E[S'(t)|jFQ] can only vanish algebraically with an exponent in [1, 2). 
A hand-waving argument for the lack of exponential convergence is that in each step, the 
reduction of S{t) never has a linear term in S{t), but is always dominated by a power S^{t), 
with /9 > 3/2. 




FIG. 2: Simulated sequence of S{t) generated by the super-optimized algorithm and the auxiliary 
sequence Zf. The dashed straight hue is a guide for the eye, representing the behavior. 



Although this "gedanken" simulation is not feasible for a meaningful physical model, it 
is trivial to numerically test the super-optimized algorithm with a small program. We have 
simulated the evolution of a vector p(t) according to this deterministic evolution and the 
result is plotted in Fig. |2] The initial value is set at pi{0) = 1 /N + ei, where e/ are small error 
terms whose sum is zero. The total error S{t) = ||e(t)|p does vanish as asymptotically. 

Note that the optimal / applied above is based on the assumption that the next bin 
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will be drawn randomly with probability p{t). However, we eliminate the randomness after 



choosing /. With a second thought, if p{t) is known, we can maximize Eq. (17) for both y 
and / and obtain 

Af^{t)<^N{N-l)S{t). (20) 

The desired linear term S{t) appears, which leads to an asymptotically exponential conver- 
gence. However, this "ultra-optimized" algorithm is not feasible in actual simulation for the 
same reason given above. In fact, if we precisely know p{t), there is no need to calculate it 
with a simulation. The uncertainty in the estimated p{t) always results in an average over 
a certain probability distribution. 

Now we have sufficient reason to believe that in the actual simulation S{t) decreases as 
1/t" with 1 < a < 2. Correspondingly, the optimal modification factor is roughly given by 
1 + NS{t), or In/ ^ NS{t) ~ 1/t". Since the actual simulation is stochastic, and there 
are other inefficiencies that we have ignored in the above analysis, it is likely to be safe to 
choose a = 1. This choice is consistent with the recently discovered 1/t WL algorithm, 
which achieves an impressive 1/t^/^ convergence in error. 

C. Numerical simulation 

As we stated earlier, S{t) is expected to be distributed in a narrow peak around its 
mean. We can write S(t + 1) as S(t + 1) = Q {S{t) — S'^it)), where Q is a random variable 
satisfying Ct > and E,[Q] = 1- We have given an example earlier that (t can be chosen so 
that E[S'(t)|jF] decreases exponentially. It is interesting to ask the condition that (t must 
satisfy so that E[S'(t)|JF] still decreases as 1/t. We have performed numerical experiments to 
investigate the nonlinear stochastic evolution of S{t), assuming Q are independent lognormal 
random numbers, i.e. Q = e""/^"*"^^, and z is A^(0, 1), or uniform random numbers with a 
box distribution, i.e. -P(C) = a^^l(l — a/2 < Ct < l + a/2). Several averaged trajectories for 
S{t) with different u and 5'(0) = So are plotted in Fig. [s] They indicate that if the variance 
of (t is small and t is sufficiently small, then E[S'(t)|jF] decreases as l/(t + 1/5(0)), but 
once e"* ^ 1, S{t) starts to decrease exponentially with large fluctuations. The simulations 
with box distribution for Q exhibit the same behavior. These simulations are also numerical 



evidence of the 1/t upper bound that is proved in Sec, HI A 



For small u and 5*0, we can approximate the stochastic difference equation of the process 
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FIG. 3: (color online) Average trajectories of S{t) with different initial condition S'(O) and the 
diffusion constant u. Every curve is the average of 1000 samples. These curves follows l/(i + Sq^) 
for small t, but crosses over to exponential decay at large t. 

S{t) with a continuous time stochastic differential equation: 

dS(t) = -S\t) + V^S{t)dBt, (21) 



where Bt is the standard Brownian motion. This equation has an analytic strong solution: 

Sit) = ^ , (22) 

l/So + JiUis)ds' 

where 5(0) = 5*0 and U(t) = exp {—ut/2 + y/uBf) is the standard exponential martingale of 
the Brownian motion. The expectation value of S{t) is not straightforward to calculate, but 
E[l/S'(t)] can be quickly calculated using the properties of exponential martingale U{t): 

^ 5(0) " — ■ '''' 
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For small t and u, by applying Taylor expansion to Eq. (22) and taking expectation term 
by term, we have obtained 



ns{t)] = So 



1 + j2{-rs^E 



n=l 



U (Si) ■ ■ ■ U {Sn)dSi ■ ■ ■ dSr^ 



So " Sq 



(1 + Sos) 



where we have also carelessly exchanged summation and expectation, and used the fact that 
E,[U{si)U{s2) ■ ■ ■U{sn)U{t)] = e"'^" when si < S2 < ■ ■ ■ Sn < t. Obviously, this expression 



breaks down when the integral is larger than I/Sq. However, both Eqs. (23) and (24) are 
consistent with the fact that the deterministic evolution S(t) = So/ {I + Sot) is stable in 
the sense that for arbitrary t, one can find a u small enough so that the error at time t 
is smaller than any given amount. The noise controlled by u makes S{t) fluctuate around 
this deterministic evolution when u and t are relatively small, but when e"* ^ 1, the S{t) 
becomes very noisy and E[5'(t)] decreases exponentially. 

Since in the actual simulations, the error has been observed to decrease as 1/ \/t seemingly 
forever, the above model with constant u is probably not good enough for t ^ The 
effective u in real simulation should decrease as S{t) decreases. 



IV. CONCLUSIONS AND DISCUSSIONS 



We suggest that the modification factor / of the WL algorithm should be chosen to 
maximize the expected increment of the convergence measure fi{t), and that the resultant 
optimal modification factor is proportional to the fluctuation in the estimated density of 
states. With this optimal strategy, we have proved that the best behavior of the statistical 
error is bounded by and 1/t, where the latter is deduced with the help of an "gedanken" 

super-optimized WL algorithm. It is clear that the WL algorithm never converges expo- 
nentially, roughly speaking because the reduction in the fluctuation S(t) is proportional to 
S^{t) with 3/2 < j3 < 2. Since the optimal modification factor is proportional to S{t), one 
should never decrease the modification factor exponentially. 

Our estimation of the convergence is consistent with the recent numerical investigation of 
the 1/t WL algorithm. We have proved that if the optimal modification factor is adopted, 
In/ oc with 1 < a < 2. Actually, all existing numerical evidence suggest that in 
fact a = 1 is the best case for all pure Monte Carlo algorithms. A test with the proposed 
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strategy applied to Ising model also exhibits the 1/t convergence. Numerical simulation of 
the process S{t + 1) = Ct{S{t) — S'^(t)) indicates that a = 1 is stable when Q has a small 
variance. Our proof of the lower bound suggests that only if deterministic moves are mixed 
with the pure Monte Carlo algorithm, can a simulation achieve a > 1. Deterministic steps 
are not free, they need additional information about the density of states. Using a suitable 
initial estimation is one form of injecting information into the simulation. If new pieces 
of information are available during the simulation, it is possible to use them to modify 
the estimated density of states on the fly. The fact that the super- and ultra-optimized 
deterministic algorithms have a faster convergence than the pure Monte Carlo algorithm 
is reminiscent of a wide variety of numerical integration algorithms, which converges at 
different speeds depending on the level of stochasticity in the algorithm. 

Our results could have been a motivation for the 1/t WL algorithm. However, we have 
not proved the converse proposition that with the 1/t algorithm, the convergence in error is 
1 / ^/i. Since our optimal / is a random process that actually fluctuates around 1/t, we expect 
the fluctuations to cancel out in the long run. Some approximations have been adopted in 
order to add to the readability of this paper. All the derivations presented here can be 
put into more rigorous formats. Finally, we stress that the focus in our study in this paper 
is the dynamic evolution of a random process extracted from the Monte Carlo simulation. 
This approach should be complementary to the traditional view that emphasizes on the 
stationary distribution and detailed balance of a Markov process. We hope this strategy 
and other techniques introduced here are of some general interest to research in Monte 
Carlo algorithms, especially those non-Markovian algorithms. [35] 
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